Libraries
rm(list = ls())
library(tidyverse)
library(igraph)
library(tidygraph)
library(ggplot2)
library(ggraph)
library(ggthemes)
library(here)
library(caret)
library(purrr)
library(DataExplorer)
library(GGally)
library(glmnet)
#library(mice) # ended up not using imputation
library(conflicted)
#avoid conflicts in packages
conflicted::conflict_prefer("filter", "dplyr")
conflicted::conflict_prefer("select", "dplyr")
conflicted::conflict_prefer("simplify", "igraph")Read the data
Description of the dataset
This network is a directed and unweighted social communication network, which represents the exchange of emails among members of the Rovira i Virgili University in Spain, in 2003.
Source: Netzschleuder.
Interpreting the email dataset as a SIR model
To help interpret the findings within the context of the university email network, we explain the following settings. We can consider the epidemic as the spread of rumors between individuals. For example, it could be some sort of negative campaign about a Professor to ruin their reputation (but never a Social Networks teacher obviously - don’t worry Blas, you’re safe :D).
We could the actors in the model:
Susceptible (S): Individuals who have not yet received the email
Infected (I): Individuals who have already recieved the bad email and joined the campaign. They can now send the rumors on to other individuals in the university
Recovered (R): Individuals who have already received the bad email and were infected, but now realize that it’s just a rumor and are NOT SUSCEPTIBLE of believing the email if they see it again AND will not send it to others if they see it. They are immune to the rumor.
Nodes and links
## # A tibble: 6 × 3
## `# index` name `_pos`
## <dbl> <dbl> <chr>
## 1 0 1 array([-7.40536013, 2.36294663])
## 2 1 2 array([-7.44612057, 2.39157152])
## 3 2 3 array([-7.3831753 , 2.37534173])
## 4 3 4 array([-7.41577001, 2.44331821])
## 5 4 5 array([-7.35737624, 2.33315794])
## 6 5 6 array([-7.37931506, 2.36491993])
## # A tibble: 6 × 2
## `# source` target
## <dbl> <dbl>
## 1 0 1
## 2 0 2
## 3 0 3
## 4 0 4
## 5 0 5
## 6 0 6
Within the nodes dataframe, the column
# index is the one indicating the index of the nodes, the
column name would contain the corresponding name of each
node, but they have been anonymized for privacy reason so this column is
still a series of numbers. The _pos column represents the
coordinates of each node in a 2D space, used for visualization or layout
in graph-related tasks.
In the links dataframe, the column # source
and target indicate the indexes of the 2 nodes forming a
link.
Graph
## IGRAPH 69c7b7c DN-- 1133 10903 --
## + attr: name (v/n), _pos (v/c)
## + edges from 69c7b7c (vertex names):
## [1] 1-> 2 1-> 3 1-> 4 1-> 5 1-> 6 1-> 7 1-> 8 1-> 9 1->10 1->11 1->12 1->13
## [13] 1->14 1->15 1->16 1->17 1->18 1->19 1->20 1->21 1->22 1->23 1->24 1->25
## [25] 1->26 1->27 1->28 1->29 1->30 1->31 2->18 2->32 2->13 2->16 2->10 2->24
## [37] 2->33 2->34 2->22 2-> 1 2->35 2-> 3 2-> 9 2->20 2->36 2-> 7 2-> 6 2-> 8
## [49] 2->37 2->11 2-> 4 2->38 2->19 3->39 3->21 3->40 3->13 3-> 2 3-> 1 3-> 7
## [61] 3-> 8 3->34 3->28 3->23 3->41 3->42 3->43 3-> 9 3->11 3-> 4 3->18 3->19
## [73] 3->44 3->16 3->45 3->46 3-> 6 3->20 3->27 3->22 3->47 3->48 3->49 3->50
## [85] 3->12 3->51 3->52 3->53 3->54 3->55 3->56 4->57 4->58 4->59 4-> 1 4->60
## + ... omitted several edges
Initial graph exploration
To understand the graph, we can plot the degree distribution to check whether we have the expected power-law distribution or another type. This will help to interpret the findings later when we remove links.
First, we simplify to remove 1 recursive loop. And count edges and nodes.
paste0("The number of recursive (self-directed) emails in the network is: ",
ecount(graph) - ecount(simplify(graph)))## [1] "The number of recursive (self-directed) emails in the network is: 1"
# we see that one exists, so we simplify the graph to remove it.
graph <- simplify(graph)
paste0("There are ", vcount(graph), " nodes and ", ecount(graph), " edges in our Spanish email network")## [1] "There are 1133 nodes and 10902 edges in our Spanish email network"
## [1] TRUE
Next, calculate basic descriptive statistics of the plot:
#degree
degree_all <- degree(graph, mode="all")
degree_out <- degree(graph, mode="out")
paste0("The average degree is: ", round(mean(degree(graph)), 2), " when we DO NOT consider the direction of the emails.")## [1] "The average degree is: 19.24 when we DO NOT consider the direction of the emails."
paste0("The average degree is: ", round(sum(degree_out)/vcount(graph), 2), " when we DO consider the emails to be directed outward from the node (sender) to a recipient")## [1] "The average degree is: 9.62 when we DO consider the emails to be directed outward from the node (sender) to a recipient"
# most connected node:
paste0("The most connected node is number ", which.max(degree(graph))[1], " which has ", max(degree(graph)), " links.")## [1] "The most connected node is number 105 which has 142 links."
For the purposes of this homework, we are not factoring in this directional property of the graph. In fact, we will remove the directionality later to create a social network where teh virus can spread both ways.
Now check the degree distribution;
# set average node for label
avg_deg <- round(mean(degree(graph)),2)
# plot linear distribution
ggplot() +
geom_histogram(aes(x = degree_all), bins=40) +
geom_vline(aes(xintercept = avg_deg, colour = "red"), show.legend = FALSE) +
annotate("text", x = avg_deg+5, y = 160, label = paste0("Avg deg = ", avg_deg), angle = 90, color = "red") +
labs(title = "Degree distribution histogram for Uni emails",
x = "Degrees",
y = "Number of nodes") +
theme(axis.title.y = element_text(angle = 90))We see the common power law distribution. There are few very connected nodes and many less connected nodes. We see that the max node must be node number 105 we reported earlier, with its 142 links. This is far more than the second most connected.
Visualization of the network:
# set colour and size only for the top 50 nodes
top_50_nodes <- order(degree_all, decreasing = TRUE)[1:50]
V(graph)$color <- "grey3"
V(graph)[top_50_nodes]$color <- "red"
V(graph)$size <- 0.75
V(graph)[top_50_nodes]$size <- 2
# plot the full network
ggraph(graph, layout = "kk") +
geom_edge_link(width = 0.1, alpha = 0.5, color = "grey") +
geom_node_point(color = V(graph)$color, size = V(graph)$size) +
theme_void() +
ggtitle("Visualisation of the network, 50 most connected nodes highlighted")+
theme(legend.position = "none")We see that the most connected nodes are all centered in the graph. On the outside, we can see the nodes with only a couple of connections.
Steps
1) Find the theoretical epidemic threshold for your network for the information βc to reach a significant number of nodes.
The formula to calculate the theoretical epidemic threshold that we learned from class is:
\[ \beta_c = \mu \frac{\langle k \rangle}{\langle k^2 \rangle}\]
where:
μ (mu): Recovery rate — typically set to 0.1 in SIR models.
⟨k⟩ (mean degree): The average number of connections (edges) per node in the network.
⟨k²⟩ (mean squared degree): The average of the square of each node’s degree — it captures how much variability (or inequality) there is in node connectivity.
Updating our network to be undirected
Firstly, we make our network undirected. We use this to create a more true social network where spreading is symmetric between nodes and can happen either way.
We use the undirected_graph function from igraph with mode=“collapse” to update the graph. The effect of this is:
- It merges reciprocal edges (A→B and B→A) into one link, which matches the idea of a “social tie”.
That’s why with collapse, the number of edges drops by half. this better reflects the true number of relationships.
# Create the undirected version of the graph
graph <- as_undirected(graph, mode = "collapse")
paste0("There are ", vcount(graph), " nodes and ", ecount(graph), " edges in our undirected Spanish email network.")## [1] "There are 1133 nodes and 5451 edges in our undirected Spanish email network."
Now we start to calculate epidemic threshold using degree moments:
# Set recovery rate (mu) — use the standard 0.1 value
mu <- 0.1
# Calculate degree of each node
k <- degree(graph)
# Mean degree and mean squared degree
mean_k <- mean(k)
mean_k2 <- mean(k^2)
# Calculate threshold with our formula: βc = μ * <k> / <k^2>
beta_c <- mu * mean(degree(graph)) / (mean(degree(graph)^2))
# Output the result
paste0("Theoretical epidemic threshold βc is approximately: ", round(beta_c, 5),
" because ⟨k²⟩ = ", round(mean_k2, 4), " > ⟨k⟩ = ", round(mean_k, 4))## [1] "Theoretical epidemic threshold βc is approximately: 0.00535 because ⟨k²⟩ = 179.8164 > ⟨k⟩ = 9.6222"
The estimated threshold βc ≈ 0.00535 is very low, indicating that even a small infection rate β would allow the information to reach a significant portion of the network.
This low threshold is a direct consequence of the high variance in degree:
⟨k⟩ = 9.62 but ⟨k²⟩ = 179.82, which suggests the presence of a few nodes with very high degree (i.e., “super spreaders”).
The large gap between ⟨k⟩ and ⟨k²⟩ is a signature of heterogeneity in the network. This is typical in real-world social networks, where most people have few contacts, and a few individuals are highly connected.
2) Assuming that randomly-selected 1% initial spreaders, simulate the SIR model below and above that threshold and plot the number of infected people as a function of β.
Here we simulate the SIR model with different β values. In the model, the values represent:
- 0 = S, 1 = I, 2 = R
The broad process is:
Set the inputs - graph, threshold, recovery rate & seeds.
Initialise all nodes as 0 in period 0, except infected which are the randomly assigned seeds.
Create output table with time and infection period
Loop through while there are still infected people. Adding new infections to the table in the period they became infected.
All nodes start as value 0 as they’re uninfected. We then apply infection to the randomly selected seeds and create the function for the SIR model:
# Set recovery rate μ
mu <- 0.1
# SIR simulation function
sim_sir <- function(g, beta, mu, seeds){
state <- rep(0, vcount(g))
state[seeds] <- 1 # set initial spreaders
t <- 0
table <- data.frame(t=0, inf=seeds)
while(sum(state == 1) > 0){
t <- t + 1
new_inf <- c()
for(i in which(state == 1)){ # among all infected ones
neighbors <- neighbors(g, i)
for(j in neighbors){
if(state[j] == 0 && runif(1) < beta){
# If neighbor j is susceptible (i.e., state[j] == 0), then infect it with probability β. "runif(1) < beta" means to generate a random number between 0 and 1. If it’s less than β, the infection is considered successful.
new_inf <- c(new_inf, j)
}
}
if(runif(1) < mu){ # Recover an infected node with probability μ
state[i] <- 2
}
}
new_inf <- unique(new_inf) # fix: avoid duplicates from multiple infected neighbors
state[new_inf] <- 1
if(length(new_inf) > 0){ # only record when there is new infection
table <- rbind(table, data.frame(t=t, inf=new_inf))
}
}
distinct(table) # adding distinct function to keep only unique nodes at each stage, avoiding repeated counts from infections to multiple neighbours in the same time period
}Now we run the simulations over different values. Since our critical threshold was 0.00535, we select 3 values smaller than the threshold and 3 above.
# set the thresholds, β, to test (lower than, close to, or higher than βc)
beta_vals <- c(0, 0.002, 0.005, beta_c, 0.01, 0.02, 0.1)
# keep recovery rate (mu) at 0.1
mu <- 0.1
# randomly set 1% as seeds - this is 12 nodes (of our 1133 total)
set.seed(123)
num_seeds <- ceiling(0.01 * vcount(graph))
seeds <- sample(1:vcount(graph), num_seeds)
# run n times of simulation for each β. Calculate infection total for each.
n_runs <- 10
results <- data.frame()
for(beta in beta_vals){
final_counts <- c()
for(i in 1:n_runs){
sim <- sim_sir(graph, beta, mu, seeds)
final_counts <- c(final_counts, length(unique(sim$inf)))
}
avg_inf <- mean(final_counts)
sd_inf <- sd(final_counts)
results <- rbind(results, data.frame(beta=beta, avg_infected=avg_inf, sd=round(sd_inf,2)))
}
# and view the results:
results## beta avg_infected sd
## 1 0.000000000 12.0 0.00
## 2 0.002000000 15.0 2.21
## 3 0.005000000 31.8 16.63
## 4 0.005351148 18.2 6.18
## 5 0.010000000 298.9 150.22
## 6 0.020000000 635.3 38.37
## 7 0.100000000 1008.4 13.38
Now we can plot the results to compare more easily:
# viz:β vs averaged number of infected people
ggplot(results, aes(x=beta, y=avg_infected)) +
geom_line(color="steelblue") +
geom_point(size=3) +
geom_errorbar(aes(ymin=avg_infected - sd,
ymax=avg_infected + sd),
width=0.001, alpha=0.6) +
labs(title = "Infection Size vs β (Random 1% Seeds)",
x = expression(beta),
y = "Average number of infected nodes") +
theme_minimal()The plot clearly demonstrates the presence of an epidemic threshold in the network. When the infection rate β is below approximately 0.00535, the average number of infected nodes remains low, indicating limited or failed spread. However, just beyond this value, there is a sudden and sharp increase in the number of infections, revealing a nonlinear transition typical of epidemic processes in complex networks.
The number of infections was most variable around β = 0.01. This indicates that close to this tipping point, small, random things – like exactly where the infection starts or how people are connected to each other – can make a huge difference in how many people end up getting infected. This is an important consideration through this analysis - which often uses 1% starting infection rates.
Overall, the plot illustrates a classic epidemic threshold phenomenon. Once β exceeds the critical value, even a small increase in the infection probability can lead to a disproportionately large outbreak, confirming the nonlinear nature of spreading processes in heterogeneous networks.
Now we plot the general trend to try identify smoothness (note that this plot only has 1 run per point so could be considered less robust than the previous).
# set the seeds for consistency in the starting infection nodes:
set.seed(123)
num_seeds <- ceiling(0.01 * vcount(graph))
seeds <- sample(1:vcount(graph), num_seeds)
# plot the outputs over time
results_q2 <- map_dfr(seq(0,0.1,0.005), # beta range
\(beta){
realization <- sim_sir(graph,beta,mu,seeds) # make a realization for each beta
data.frame(beta,
ninf=nrow(realization)) # create dataframe column for beta and for number infected
})
results_q2 %>% ggplot(aes(x=beta,y=ninf))+
geom_point()+
geom_line(color="red", alpha=0.5)+
geom_vline(xintercept = beta_c,linetype=2)+
labs(title = "Comaparing total infections to beta values")We see the trend holds. There is a very sharp increase around the critical threshold. Note that this value is very low and the increments between each value in the plot are also small - each representing a 0.5% increase in the beta threshold (\(\beta = 0.005\)).
With such a low critical threshold, we are very likely to end up with an epidemic in this model. Any growth is likely to lead to an epidemic. This is seen because our critical beta value is below 1% (around 0.535%). This means that in the following questions, when we select any 1% sample, whether it is random or targeted, it will likely lead to an epidemic.
To illustrate this, we see the basic reproduction number with 1 and 2 percent infection. The basic reproduction number (R₀) when beta is 0.02:
mu <- 0.1 # Recovery rate
# Calculate R₀ options a
R0_1 <- (0.01 / mu) * (mean_k2 / mean_k)
R0_2 <- (0.02 / mu) * (mean_k2 / mean_k)
# print result
paste0("Basic reproduction number R₀ ≈ ", round(R0_1, 3), " with a 1% infection rate. Or, ", round(R0_2, 3), " with a 2% infection rate.")## [1] "Basic reproduction number R₀ ≈ 1.869 with a 1% infection rate. Or, 3.738 with a 2% infection rate."
R₀ > 1 means the epidemic is likely to spread. We obtained an R₀ of 3.738, that is, very probable to spread, when the infection probability per contact per time step is only 2%.
Seems like these rumors will be spreading in the university!
3) Choose a β well-above above βc. Using centrality , communities or any other suitable metric, find a better set of 1% of seeds in the network so we get more infected people than the random case. Measure the difference of your choice with the random case as:
a) The difference in the total number of infected people
First, we set our very high β. Since our actual threshold βc is VERY low (<0.01), we chose a beta value that may seem low but is actually very high relative to our value.
We select β = 0.15
We compare the random seeds to Degree centrality and PageRank to compare a more simple measure to a more rich measure. These two should be correlated, but we include both to see if the more rich PageRank calculation identifies nodes which lead to more rapid spreading.
# set the comparatively high beta and keep our same recovery rate.
beta_high <- 0.15
mu <- 0.1
num_seeds <- ceiling(0.01 * vcount(graph))
## Collect each type of sample to compare the outcomes
# 1. Random selection (baseline)
set.seed(123)
random_seeds <- sample(1:vcount(graph), num_seeds)
# 2. Degree centrality
degree_cent <- degree(graph)
degree_seeds <- order(degree_cent, decreasing = TRUE)[1:num_seeds]
# 3. PageRank
pagerank_cent <- page_rank(graph)$vector
pagerank_seeds <- order(pagerank_cent, decreasing = TRUE)[1:num_seeds]
# check the centrality measures arent' identical nodes for redundancy
identical(degree_seeds, pagerank_seeds)## [1] FALSE
From this, we have our 1% of central nodes (which is equal to 12 nodes in our data) and we now compare the random selection vs targeted selection to see how rapidly the epidemic can spread.
## run simulations based on each of the selected and random nodes
random_sim <- sim_sir(graph, beta_high, mu, random_seeds)
degree_sim <- sim_sir(graph, beta_high, mu, degree_seeds)
pagerank_sim <- sim_sir(graph, beta_high, mu, pagerank_seeds)# we use the unique option to ensure we have properly accounted for any duplicates (although this appears to function properly)
rand_num_inf <- length(unique(random_sim$inf))
deg_num_inf <- length(unique(degree_sim$inf))
pager_num_inf <- length(unique(pagerank_sim$inf))Print all of our results for total infections. They are relatively similar at this high value. Which is not surprising.
Of our total, 1,133 people in the university network, we see that nearly all are infected in each model. This is because of the high infection rate above our critical threshold leaving to long run epidemics.
cat(
"Of the ", vcount(graph), " nodes. We see the following under each setting:\n",
"Random selection: ", rand_num_inf, " (equal to ", round(rand_num_inf/vcount(graph)*100,2), "%)\n",
"Degree centrality: ", deg_num_inf, " (equal to ", round(deg_num_inf/vcount(graph)*100,2), "%)\n",
"Random selection: ", pager_num_inf, " (equal to ", round(pager_num_inf/vcount(graph)*100,2), "%)")## Of the 1133 nodes. We see the following under each setting:
## Random selection: 1056 (equal to 93.2 %)
## Degree centrality: 1052 (equal to 92.85 %)
## Random selection: 1066 (equal to 94.09 %)
We see that at such high beta values, almost everyone is infected regardless of the selection method. As we are above the critical threshold, it becomes an epidemic regardless of who starts it.
b) The difference in the time of the peak of infection (when most infections happen).
First, aggregate the data into one table from the 3 outputs
## line plot with both matched, or 2 plots if we show each of susceptible, infected and recovered
# add labels to merge the DF
random_sim$type <- "Random"
degree_sim$type <- "Degree centrality"
pagerank_sim$type <- "PageRank"
results_q3 <- rbind(random_sim, degree_sim, pagerank_sim)
table(results$type)## < table of extent 0 >
We now have a long table with our 3 data points.
## aggregate the number of infections by each selection method first then plot
results |>
group_by(type, t) |>
summarise(sum_inf = n(), .groups = 'drop') |>
ggplot(aes(x = t, y = sum_inf, group = type, colour = type)) +
geom_line() +
labs(title = "Comparison of infection trends with β=15%, targeted vs random selection",
y = "Number of infections",
x = "Time period",
colour = "Selection method") +
theme_minimal() + # Or your preferred theme
theme(axis.title.y = element_text(angle = 90))Summary - we can show that targeted infection. I.e. those who are the most influential in the email system being the first ones to spread the rumor can make the rumor spread faster than if random people in the network try to start the rumor.
PageRank is able to create a slightly faster infection peak and slightly higher than the degree centrality measure as hypothesised. However these 2 are highly correlated.
The peaks between each network are comparable however, because we have such a high beta value. Each selection method has a peak of around 225 infections on their biggest day of spreading. The only real difference is a delay in the occurrence of the peak spreading.
# calculate the peak period in each
results |>
group_by(type, t) |>
summarise(sum_inf = n(), .groups = 'drop') |>
group_by(type) |> filter(sum_inf == max(sum_inf))## # A tibble: 3 × 3
## # Groups: type [3]
## type t sum_inf
## <chr> <dbl> <int>
## 1 Degree centrality 3 218
## 2 PageRank 3 231
## 3 Random 5 223
We see that the peak for PageRank is period 3 (231 infections), Degree centrality is period 3 as well, (218 infections) and the random allocation was period 5 (223 infections). So there was a 2 period acceleration in the peak by using targeted nodes.
To illustrate whether the peaks differ at other levels, we quickly create a model with the beta infection rate as 3x the critical beta value, (which would be equal to just over 1.5% and almost 10 times less infectious than the above plot:
NOTE: We have reran this with the lower infection rate (3x critical threshold) as we weren’t sure if the 15% infection rate we selected was too extreme to see any really useful results.
## comapring with new more moderate (but still high relative to critical threshold) beta value
beta_high2 <- beta_c*3
beta_high2## [1] 0.01605344
random_sim <- sim_sir(graph, beta_high2, mu, random_seeds)
degree_sim <- sim_sir(graph, beta_high2, mu, degree_seeds)
pagerank_sim <- sim_sir(graph, beta_high2, mu, pagerank_seeds)
# add labels, join and check total values.
random_sim$type <- "Random"
degree_sim$type <- "Degree centrality"
pagerank_sim$type <- "PageRank"
results <- rbind(random_sim, degree_sim, pagerank_sim)
table(results$type) # we see the total infections are much lower. ##
## Degree centrality PageRank Random
## 519 540 504
## now plot again to compare the trends.
results |>
group_by(type, t) |>
summarise(sum_inf = n(), .groups = 'drop') |>
ggplot(aes(x = t, y = sum_inf, group = type, colour = type)) +
geom_line() +
labs(title = paste0("Comparison of infection trends with β=",round(beta_high2*100,2),"%, targeted vs random selection"),
y = "Number of infections",
x = "Time period",
colour = "Selection method") +
theme_minimal() + # Or your preferred theme
theme(axis.title.y = element_text(angle = 90))Here we see the similar trend but on a much smaller scale and with more visible noise. The Degree Centrality measure dies out almost by about period 40. At this stage, the random assignment is just starting to find it’s peak values. Similarly, the infections drop off at the start for our random assignment when the selected samples immediately peak due to the importance of the infected nodes initially infected.
4) Using the same β, design a “quarantine strategy”: at time step t=3 or 4, quarantine of 20% the susceptible population. You can model quarantine by temporally removing these nodes. Release the quarantined nodes 8 time steps later, making them susceptible again. Measure the difference with respect to no quarantine.
We will create firstly a new function with the quarantine strategy (based on our previous regular SIR model), where we quarantine 20% of the susceptible population at t=4, then release them at t=12.
sim_sir_quarantine <- function(g, beta, mu, seeds, quarantine_t = 4, release_t = 12){
state <- rep(0, vcount(g)) # 0 = S, 1 = I, 2 = R, -1 = Quarantined
state[seeds] <- 1 # set initial spreaders
t <- 0
table <- data.frame(t=0, inf=seeds)
quarantine_group <- c()
while(sum(state == 1) > 0){ # while there is still infected people
t <- t + 1
new_inf <- c()
# quarantine 20% of susceptible nodes at time t = quarantine_t
if(t == quarantine_t){
susceptible_nodes <- which(state == 0)
set.seed(t) # reproducibility
quarantine_group <- sample(susceptible_nodes, ceiling(0.2 * length(susceptible_nodes)))
state[quarantine_group] <- -1 # set them as "quarantined"
}
# release them back at time t = release_t
if(t == release_t && length(quarantine_group) > 0){
state[quarantine_group] <- 0 # become susceptible again
}
for(i in which(state == 1)){ # loop through infected nodes
neighbors <- neighbors(g, i)
for(j in neighbors){
if(state[j] == 0 && runif(1) < beta){
new_inf <- c(new_inf, j)
}
}
if(runif(1) < mu){
state[i] <- 2 # recover
}
}
new_inf <- unique(new_inf) # fix: avoid duplicate infections
state[new_inf] <- 1
if(length(new_inf) > 0){
table <- rbind(table, data.frame(t=t, inf=new_inf))
}
}
table
}Then, we will simulate 20 times with the regular SIR model and the SIR model with quarantine policies to observe the differences in the long run and short run (during the quarantine period).
# set the very high beta level to same as q3
beta_high = 0.15
# Run simulation 20 times with quarantine
results_q <- replicate(20, {
sim <- sim_sir_quarantine(graph, beta_high, mu, seeds)
length(unique(sim$inf)) #register the total number of infected
})
# Run simulation 20 times without quarantine
results_no_q <- replicate(20, {
sim <- sim_sir(graph, beta_high, mu, seeds)
length(unique(sim$inf)) #register the total number of infected
})
# Compare average results
mean_q <- mean(results_q)
mean_no_q <- mean(results_no_q)
diff <- mean_no_q - mean_q
paste0("With quarantine, avg infected = ", round(mean_q),
"; without quarantine = ", round(mean_no_q),
" → Difference = ", round(diff), " nodes.")## [1] "With quarantine, avg infected = 1029; without quarantine = 1051 → Difference = 22 nodes."
Long-run outcomes with high beta value:
We surprisingly observed that after running 20 simulations for each strategy, the quarantine scenario resulted in almost equivalent total infections (1029 vs 1051) then the model without quarantine policy, with a difference of 22 less infections.
This suggests that the quarantine intervention is not effective in reducing spread in the long run. This tells us that the beta value is so high that an epidemic will occur in the long run if the virus has time to spread freely (i.e. the quarantine period is temporary). The following sections will compare and provide more detail on the random allocation.
We will plot a box plot to compare the long run total infections for each model.
# tidy the data for later visualization
df_results <- data.frame(
infected = c(results_no_q, results_q),
strategy = rep(c("No Quarantine", "With Quarantine"), each = 10)
)
# Visualize the difference of results with boxplot
ggplot(df_results, aes(x = strategy, y = infected, fill = strategy)) +
geom_boxplot(alpha = 0.5, width = 0.5, outlier.shape = NA) +
geom_jitter(width = 0.1, size = 2, alpha = 0.7) + # Each dot is one simulation
labs(title = "Comparison of Final Infection Counts, β = 0.15",
x = "Strategy",
y = "Total number of infected nodes") +
theme_minimal() +
theme(legend.position = "none")The results reveal that the median infection counts are very similar between the two strategies, indicating no strong difference in average outcomes. However, the no-quarantine group exhibits slightly more variability, with a few more extreme values on both ends.
Interestingly, the quarantine strategy does not significantly reduce infections — in fact, it slightly increases the average count in this particular setting. This may be due to reintroducing susceptible nodes back into the network at a time when the infection is still active, or due to randomly quarantining non-critical nodes.
Overall, this visualization suggests that random temporary quarantine of 20% of susceptible individuals at t = 4 to t=12 is not an effective intervention at high infection rates (β = 0.15). Future strategies may benefit from more targeted approaches, such as quarantining high-centrality nodes.
Short-term quarantine outcomes with high beta value:
Apart from the final result, does the quarantine policy work throughout the quarantine? If it can slow the spread, we can at least show the quarantine works to some extent.
# We run the simulation 20 times again, as previously we only obtained the final infected number, but we didn't register the infected number at each stage
set.seed(123)
runs_no_q <- map(1:20, function(i) {
sim <- sim_sir(graph, beta_high, mu, seeds)
sim$run <- i
sim
})
runs_q <- map(1:20, function(i) {
sim <- sim_sir_quarantine(graph, beta_high, mu, seeds)
sim$run <- i
sim
})
# Join simulation data from both policies
all_runs_no_q <- bind_rows(runs_no_q) %>% mutate(strategy = "No Quarantine")
all_runs_q <- bind_rows(runs_q) %>% mutate(strategy = "With Quarantine")
all_runs <- bind_rows(all_runs_no_q, all_runs_q)Now we calculate the cumulative infected number and its confidence interval in each time period for both policies, and visualize the results.
# summarise the findings to long form for plotting both with/without over time
cumulative_all <- all_runs %>%
group_by(strategy, run, t) %>%
summarise(new_inf = n(), .groups = "drop") %>%
group_by(strategy, run) %>%
arrange(t) %>%
mutate(cum_inf = cumsum(new_inf)) %>%
group_by(strategy, t) %>%
summarise(
mean_inf = mean(cum_inf),
sd_inf = sd(cum_inf),
ci_upper = mean_inf + 1.96 * sd_inf / sqrt(50),
ci_lower = mean_inf - 1.96 * sd_inf / sqrt(50),
.groups = "drop"
)
# and actually plot them
ggplot(cumulative_all, aes(x = t, y = mean_inf, color = strategy, fill = strategy)) +
geom_line(linewidth = 0.4) +
geom_ribbon(aes(ymin = ci_lower, ymax = ci_upper), alpha = 0.3, color = NA) +
geom_vline(xintercept = 4, linetype = "dashed", color = "darkred") +
geom_vline(xintercept = 12, linetype = "dashed", color = "darkblue") +
annotate("text", x = 4, y = 0, label = "Start quarantine (t=4)", angle = 90, vjust = -0.5,hjust=-1, color = "darkred", size = 3) +
annotate("text", x = 12, y = 0, label = "End quarantine (t=12)", angle = 90, vjust = -0.5, hjust=-1, color = "darkblue", size = 3) +
labs(title = "Infection Spread with quarantine, β = 0.15",
subtitle = "With vs Without Quarantine (20 simulations each)",
x = "Time step (t)",
y = "Cumulative number of infected nodes") +
theme_minimal() +
theme(legend.title = element_blank())This plot illustrates the average cumulative number of infected nodes over time across 20 simulations, comparing the No Quarantine and With Quarantine strategies, where the vertical dashed lines indicate the start (t = 4) and end (t = 12) of the quarantine period.
We observe that the quarantine period does reduce infections. With on average 138.25 fewer infections by period 12 (the end of the quarantine period). We see that the growth in infections is still rapid with quarantine, so it doesn’t flatten the rate of spreading very effectively.
# check differences in counts at end of quarantine
no_q_t12 <- cumulative_all |>
filter(t == 12) |> select(strategy, mean_inf) |> filter(strategy == "No Quarantine") |> pull(mean_inf)
yes_q_t12 <- cumulative_all |>
filter(t == 12) |> select(strategy, mean_inf) |> filter(strategy == "With Quarantine") |> pull(mean_inf)
paste0("The average infections without quarantine at time 12 was: ", no_q_t12, " while with quarantine, it reduced to ", yes_q_t12, " in time 12. Indicating a reduction of ", no_q_t12-yes_q_t12, " at the end of the quarantine period.")## [1] "The average infections without quarantine at time 12 was: 1007.25 while with quarantine, it reduced to 867.9 in time 12. Indicating a reduction of 139.35 at the end of the quarantine period."
As discussed above, in the long run, an epidemic emerges and there are no differences.
This suggests that the random quarantine of 20% susceptible individuals has limited impact on the epidemic’s total spread under a high transmission rate (β = 0.15).
In conclusion, while the quarantine flattens the curve slightly, a more targeted strategy may be needed to produce meaningful reductions in total infections.
Testing with lower beta (still above threshold).
Now we had set a relatively high beta, 15%. We also wonder if the quarantine policy would work better while we have lower beta that still sits above our critical threshold. Therefore, we will try a new beta 1%, which is still higher than the critical threshold but is much lower than our previous 15%.
Here we reproduce our previous work but with beta of 1%.
# set a lower beta
beta_high = 0.01
# Run simulation 20 times with quarantine
results_q <- replicate(20, {
sim <- sim_sir_quarantine(graph, beta_high, mu, seeds)
length(unique(sim$inf)) #register the total number of infected
})
# Run simulation 20 times without quarantine
results_no_q <- replicate(20, {
sim <- sim_sir(graph, beta_high, mu, seeds)
length(unique(sim$inf)) #register the total number of infected
})
# Compare average results
mean_q <- mean(results_q)
mean_no_q <- mean(results_no_q)
diff <- mean_no_q - mean_q
paste0("With quarantine, avg infected = ", round(mean_q),
"; without quarantine = ", round(mean_no_q),
" → Difference = ", round(diff), " nodes.")## [1] "With quarantine, avg infected = 265; without quarantine = 300 → Difference = 35 nodes."
We actually see the infection rate is higher without quarantine in the long run. Overall infections are lower however. We test more robustly:
# tidy the data for later visualization
df_results <- data.frame(
infected = c(results_no_q, results_q),
strategy = rep(c("No Quarantine", "With Quarantine"), each = 10)
)
# Visualize the difference of results with boxplot
ggplot(df_results, aes(x = strategy, y = infected, fill = strategy)) +
geom_boxplot(alpha = 0.5, width = 0.5, outlier.shape = NA) +
geom_jitter(width = 0.1, size = 2, alpha = 0.7) + # Each dot is one simulation
labs(title = "Comparison of Final Infection Counts (β = 0.01)",
x = "Strategy",
y = "Total number of infected nodes") +
theme_minimal() +
theme(legend.position = "none")We see that in the long run, quarantine appears to have a slightly lower total infection rate, but it is not a clear significant difference. At the lower infection we see some cases with little spread too - likely due to lowly connected nodes being the initial spreaders.
# We run the simulation 20 times again, as previously we only obtained the final infected number, but we didn't register the infected number at each stage
set.seed(123)
runs_no_q <- map(1:20, function(i) {
sim <- sim_sir(graph, beta_high, mu, seeds)
sim$run <- i
sim
})
runs_q <- map(1:20, function(i) {
sim <- sim_sir_quarantine(graph, beta_high, mu, seeds)
sim$run <- i
sim
})
# Join simulation data from both policies
all_runs_no_q <- bind_rows(runs_no_q) %>% mutate(strategy = "No Quarantine")
all_runs_q <- bind_rows(runs_q) %>% mutate(strategy = "With Quarantine")
all_runs <- bind_rows(all_runs_no_q, all_runs_q)And summarise + plot again for trend over time.
cumulative_all <- all_runs %>%
group_by(strategy, run, t) %>%
summarise(new_inf = n(), .groups = "drop") %>%
group_by(strategy, run) %>%
arrange(t) %>%
mutate(cum_inf = cumsum(new_inf)) %>%
group_by(strategy, t) %>%
summarise(
mean_inf = mean(cum_inf),
sd_inf = sd(cum_inf),
ci_upper = mean_inf + 1.96 * sd_inf / sqrt(20),
ci_lower = mean_inf - 1.96 * sd_inf / sqrt(20),
.groups = "drop"
)
ggplot(cumulative_all, aes(x = t, y = mean_inf, color = strategy, fill = strategy)) +
geom_line(size = 0.4) +
geom_ribbon(aes(ymin = ci_lower, ymax = ci_upper), alpha = 0.3, color = NA) +
geom_vline(xintercept = 4, linetype = "dashed", color = "darkred") +
geom_vline(xintercept = 12, linetype = "dashed", color = "darkblue") +
annotate("text", x = 4, y = 0, label = "Start quarantine (t=4)", angle = 90, vjust = -0.5,hjust=-1, color = "darkred", size = 3) +
annotate("text", x = 12, y = 0, label = "End quarantine (t=12)", angle = 90, vjust = -0.5, hjust=-1, color = "darkblue", size = 3) +
labs(title = "Infection Spread Over Time (β = 0.01)",
subtitle = "With vs Without Quarantine (20 simulations each)",
x = "Time step (t)",
y = "Cumulative number of infected nodes") +
theme_minimal() +
theme(legend.title = element_blank())+
lims(x=c(0,100))While our earlier simulations with a high infection rate (β = 0.15) showed differences in infection in the quarantine period, there are minimal differences between the quarantine and no-quarantine scenarios under the lower infection rate (β = 0.01). Suggesting that the current quarantine strategy does not effectively reduce spreading at low transmission rates.
There are several possible reasons:
Randomly quarantined nodes might not be important for transmission, so the main spread continues;
Letting quarantined people back in too early (at t = 12) may restart the infection;
When β is low, small changes can have big effects due to randomness.
Taken together, these results indicate that random quarantine is not always a good solution.
A better approach would be to design smarter quarantine strategies — for example, targeting central nodes or adapting quarantine timing based on the situation — to more consistently control the spread in different scenarios.
5) Suppose now that you can convince 5% of people in the network not to spread that information at all.
If we have 5% who will not spread information at all, they are effectively immune. This means we can treat the 5% of people like they had a vaccination. Or in our situation, more like they had a realisation that spreading rumors over the university email is bad and refuse to read or share any gossip. In this section we compare if the targeted or random vaccination works more effectively.
As these people refuse to spread rumours, we call them kind_people.
a) Choose those 5% randomly in the network. Simulate the SIR model above βc using 1% of the remaining nodes as seeds. Choose those seeds randomly.
First we set our random 5% of nodes to make immune. These essentially are deleted, so we remove them from the model. We calculate these models with the critical threshold value (βc).
set.seed(1234)
# random vaccination campaign with 5% randomly selected - essentially delete them
kind_people_random <- sample(1:vcount(graph),
vcount(graph)*0.05)
graph_rand <- delete_vertices(graph, kind_people_random)
# print our findings for info
paste0("This random selection removed ", vcount(graph)-vcount(graph_rand), " nodes, and ", ecount(graph)-ecount(graph_rand), " edges from the graph")## [1] "This random selection removed 56 nodes, and 488 edges from the graph"
# now we set our 1% of seeds from the remaining nodes
inf_seeds <- sample(1:vcount(graph_rand),
vcount(graph_rand)*0.01)
# now create our random model
realization_random <- sim_sir(graph_rand,
beta = beta_c,
mu,
inf_seeds) %>%
group_by(t) %>% summarize(ninf=n())
realization_random |>
ggplot(aes(x = t, y = ninf)) +
geom_line() +
labs(title = paste0("Infection trend with randomly assigned nice people (non gossipers)"),
y = "Number of infections",
x = "Time period") +
theme_minimal() +
theme(axis.title.y = element_text(angle = 90))b) Choose those 5% according to their centrality. Simulate the SIR model βc above using 1% of the remaining nodes as seeds. Choose those seeds randomly.
We will use PageRank to select the 5% as this had the most extreme peak in Question 3.
we see that this targeted removal of nodes also takes many more potential connections as over 1800 edges are removed from the plot. This is over 3x as many potential connections removed from random selection of the 5% of nice people (vaccinated).
We see that the trend when removing the 5% of influential nodes is much more effective at killing any growth in infections. In fact we see that the infection actually dies out by period 15 and the spread of information stops.
set.seed(12345)
# targeted selection of most influential 5% randomly selected - essentially delete them, using our exiting pagerank_cent calculations
num_seeds <- vcount(graph)*0.05
kind_people_pager <- order(pagerank_cent, decreasing = TRUE)[1:num_seeds]
graph_pager <- delete_vertices(graph, kind_people_pager)
# print our findings for info
paste0("This pagerank selection removed ", vcount(graph)-vcount(graph_pager), " nodes, and ", ecount(graph)-ecount(graph_pager), " edges from the graph")## [1] "This pagerank selection removed 56 nodes, and 1867 edges from the graph"
# now we set our 1% of seeds from the remaining nodes
inf_seeds <- sample(1:vcount(graph_pager),
vcount(graph_pager)*0.01)
# now create our random model
realization_pager <- sim_sir(graph_pager,
beta = beta_c,
mu,
inf_seeds) %>%
group_by(t) %>% summarize(ninf=n())
#plot
realization_pager |>
ggplot(aes(x = t, y = ninf)) +
geom_line() +
labs(title = paste0("Infection trend with targeted nice people (non gossipers)"),
y = "Number of infections",
x = "Time period") +
theme_minimal() +
theme(axis.title.y = element_text(angle = 90))c) Measure the difference between both cases as you did in step 3.
We plot the 2 trends to compare the outputs over time initially. We also include the baseline case of infection for comparison. We see that under both these 5% kind people situations, growth never got exponential and it reduced infections compared to the baseline.
We see that Random vaccination (i.e. selection of kind people), is better than no vaccination (our baseline 1% random infection case). But targeted approach (selecting the most connnected people as kind) is much more effective and flattened the curve in period t=6.
realization<- results |>
group_by(type, t) |>
summarise(sum_inf = n(), .groups = 'drop') |>
filter(type=="Random")
ggplot() +
geom_line(data=realization,aes(x=t,y=sum_inf,col="No Vaccination")) +
geom_line(data=realization_random,aes(x=t,y=ninf,col="Vacc. Random"))+
geom_line(data=realization_pager,aes(x=t,y=ninf,col="Vacc. Targeted"))However, further simulations are needed to assess whether this effect holds consistently across multiple realizations or was the result of stochastic variance. Haven considered this, we will run a simulation of 30 times instead of comparing just a pair of random results to obtain statistically significance. (Here, we slightly increase the number of simulations, as we observed some unexpected variability in the results.)
set.seed(1234)
# set parameters for each run
beta_5c <- beta_c
mu_5c <- 0.1
runs_5c <- 30
kind_n <- ceiling(0.05 * vcount(graph))
#setup blank df
results_5c <- data.frame(
strategy = character(),
total_infected = numeric(),
peak_time = numeric()
)
# run the random model 30 sims
for (i in 1:30) {
kind_people_random <- sample(1:vcount(graph), kind_n)
graph_rand <- delete_vertices(graph, kind_people_random)
seed_n <- ceiling(0.01 * vcount(graph_rand))
inf_seeds <- sample(1:vcount(graph_rand), seed_n)
sim <- sim_sir(graph_rand, beta = beta_5c, mu = mu_5c, seeds = inf_seeds)
total_inf <- length(unique(sim$inf))
peak <- sim %>%
group_by(t) %>%
summarise(ninf = n()) %>%
filter(ninf == max(ninf)) %>%
pull(t) %>%
min() # in case of multiple peak times
results_5c <- rbind(results_5c, data.frame(
strategy = "Random",
total_infected = total_inf,
peak_time = peak
))
}
# run the pagerank 30 sims
for (i in 1:30) {
graph_pager <- delete_vertices(graph, kind_people_pager)
seed_n <- ceiling(0.01 * vcount(graph_pager))
inf_seeds <- sample(1:vcount(graph_pager), seed_n)
sim <- sim_sir(graph_pager, beta = beta_5c, mu = mu_5c, seeds = inf_seeds)
total_inf <- length(unique(sim$inf))
peak <- sim %>%
group_by(t) %>%
summarise(ninf = n()) %>%
filter(ninf == max(ninf)) %>%
pull(t) %>%
min()
results_5c <- rbind(results_5c, data.frame(
strategy = "PageRank",
total_infected = total_inf,
peak_time = peak
))
}# combine the results and print summary
summary_5c <- results_5c %>%
group_by(strategy) %>%
summarise(
avg_infected = mean(total_infected),
sd_infected = sd(total_infected),
max_peak = max(total_infected),
peak_time = min(peak_time)
)
print(summary_5c)## # A tibble: 2 × 5
## strategy avg_infected sd_infected max_peak peak_time
## <chr> <dbl> <dbl> <int> <dbl>
## 1 PageRank 17.6 5.21 39 0
## 2 Random 28.2 13.2 58 0
##plot the total infections over each runs
ggplot(results_5c, aes(x = strategy, y = total_infected, fill = strategy)) +
geom_boxplot() +
geom_jitter(width = 0.1, size = 2, alpha = 0.7) + # Each dot is one simulation
labs(title = "Total Infected (30 runs): Random vs PageRank Kind People",
y = "Total number of infected nodes") +
theme_minimal()Based on the results of 30 simulations, the PageRank-targeted vaccination strategy clearly outperforms the random vaccination strategy in reducing the spread of infection:
On average, the PageRank strategy resulted in 17.6 infected nodes, compared to 28.2 infected nodes under the random strategy. The maximum total infections was also higher under the random strategy (58 vs 39 for PageRank).
The standard deviation is also lower under PageRank (5.2 vs 13.2), suggesting more stable and consistent results.
The peak time was zero in both cases, meaning infections mostly happened early and then quickly subsided — but PageRank consistently kept the total lower. This is consistent with what we saw in the individual runs above. The plots show that the infection rate dies and no epidemics occur under these settings. These are not plotted again as the infection peak is zero in both instances, as seen in the plots in section 5a and 5b, plus the first plot in 5c.
These findings confirm that removing highly central nodes (according to PageRank) is an effective containment method. Vaccinating these central nodes breaks key transmission paths. In contrast, the random strategy may occasionally remove critical nodes, but also wastes many “vaccinations” on peripheral nodes, leading to less predictable and less effective results.
In short, when it comes to minimizing spread, targeted immunization based on network centrality is a better strategy than random selection.
7) With the results of step 2, train a model that predicts that time to infection of a node using their degree, centrality, betweeness, page rank and any other predictors you see fit. Use that model to select the seed nodes as those with the smallest time to infection in step 3. Repeat step 5 with this knowledge.
7.1. (Based on step 2) Predictive model
Firstly, we extract the results we obtained in the step 2 as our data for training. For clean results, we will run the average time to infection for each node over 20 runs. A challenge here is that some nodes may not get infected.
# recreate the Q2 output to get the infection period for each node. This will become our target variable. Run a model to average the infection period over 10 models.
set.seed(1234)
# sim 20 times for more robust results
all_sims <- map_dfr(1:20, ~sim_sir(graph, beta=0.01, mu=0.1,
seeds = sample(1:vcount(graph), num_seeds)))
# Calculate average infection for each node
avg_inf_time <- all_sims |>
group_by(inf) |>
summarise(avg_inf_time = mean(t)) |>
rename(node_num = inf)
avg_inf_time## # A tibble: 1,092 × 2
## node_num avg_inf_time
## <int> <dbl>
## 1 1 23.8
## 2 2 18.5
## 3 3 15.9
## 4 4 20.9
## 5 5 24.7
## 6 6 34.1
## 7 7 19.8
## 8 8 25.7
## 9 9 23.6
## 10 10 35
## # ℹ 1,082 more rows
# check for uninefected nodes in any run.
paste0("We have ", length(avg_inf_time$node_num), " nodes infected, meaning that there are ", vcount(graph) - length(avg_inf_time$node_num), " nodes that were not infected at any stage in the 20 runs.")## [1] "We have 1092 nodes infected, meaning that there are 41 nodes that were not infected at any stage in the 20 runs."
# infection_times <- results |>
# filter(type=="Random") |> #we only use random infection data, as other infection methods were not used in the step 2 but other steps
# group_by(inf) |>
# summarise(first_infected = min(t)) |>
# rename(node = inf)We see that there are 103 missing nodes, we will create centrality metrics then impute the missing values to have a complete dataset based on known structures and centrality.
Then, we calculate the network metrics (degree, centrality, etc.) as inputs for modelling:
metrics_df <- data.frame(
node = as.numeric(V(graph)),
degree = degree(graph),
betweenness = betweenness(graph, normalized = TRUE),
closeness = closeness(graph, normalized = TRUE),
pagerank = page_rank(graph)$vector,
eigen_centrality = eigen_centrality(graph)$vector,
clustering = transitivity(graph, type="local", isolates="zero")
)We join both data frames, then check imputation of missing values with mice. Unfortunately, because the nodes are lowly connected, they are not included in the model as the imputation is ineffective, many become 0 values for average infection time which is not realistic and caused problems modelling. For this reason, we continue by removing the 103 uninfected nodes after the 20 runs above.
# join 2 frames
data_7 <- left_join(metrics_df, avg_inf_time, by = join_by(node == node_num))
# count our missing nodes
data_7 |> filter(is.na(avg_inf_time)) |> nrow()## [1] 41
## IMPUTATION REMOVED.
# #impute those 103 values 5 times with mice
# imputed <- mice(data_7, m=5, method='pmm') # predictive mean matching
# data_7_imputed <- complete(imputed)
#
# #check for no missing values in imputed data
# data_7_imputed |> filter(is.na(avg_inf_time))
### REMOVE MISSING NODES - IN PLACE OF IMPUTATION.
data_7 <- data_7 |> filter(!is.na(avg_inf_time))We now split into training and test data. With 70% in our training set.
set.seed(123)
train_index <- sample(1:nrow(data_7), 0.7 * nrow(data_7))
train_data <- data_7[train_index, ]
test_data <- data_7[-train_index, ]Before creating the predictive models, we would like to discover the distribution and correlation among these variables:
We observed that many network metrics are strongly correlated to each other (like degree and page rank). With clustering the lowest correlation. None appear highly correlated with time to infection. The multicollinearity could negatively affect the effectiveness of the model. We will run a lasso model to do a feature selection, and compare it to a baseline model with all predictors.
Because of their distributions, we will convert betweenness, degree, clustering, pagerank and eigen centrality to their log values. This is done by the visual inspection of the density plots above.
We first do a lasso model for feature selection. We exclude the node number from all models as this is numeric but not a predictor, just an ID variable.
# create X and y
X_train <- model.matrix(avg_inf_time ~ log(degree) + log(betweenness) + closeness + log(pagerank) +
log(eigen_centrality) + log(clustering), data = train_data)[, -1]
y_train <- matrix(train_data$avg_inf_time)
# Lasso model with cross-validation
lasso_model <- cv.glmnet(X_train, y_train, alpha = 1)
# check the best lambda
lasso_model$lambda.min## [1] 0.005139901
## 7 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) -170.467819
## log(degree) 20.714339
## log(betweenness) .
## closeness -16.964849
## log(pagerank) -21.791805
## log(eigen_centrality) -1.525721
## log(clustering) .
We see the selected features are all of them. We keep 4 values after
the lasso testing: degree, closeness,
pagerank, and eigen_centrality . Considering
this information, now we build up a model. Our target variable, the time
of infection is continuos so we do a linear model.
# set base with all linear predictors
base_model <- lm(avg_inf_time ~ degree + closeness + pagerank + eigen_centrality + betweenness + clustering, data = train_data)
summary(base_model)##
## Call:
## lm(formula = avg_inf_time ~ degree + closeness + pagerank + eigen_centrality +
## betweenness + clustering, data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -32.286 -8.653 -1.305 6.144 84.425
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -13.1566 7.2580 -1.813 0.070275 .
## degree -0.2888 0.5938 -0.486 0.626863
## closeness 131.2498 27.9247 4.700 0.00000309 ***
## pagerank 7618.4290 7160.8998 1.064 0.287717
## eigen_centrality -24.5101 13.5290 -1.812 0.070434 .
## betweenness -657.0723 268.1607 -2.450 0.014499 *
## clustering 7.9472 2.3112 3.439 0.000617 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.57 on 757 degrees of freedom
## Multiple R-squared: 0.07159, Adjusted R-squared: 0.06423
## F-statistic: 9.728 on 6 and 757 DF, p-value: 0.0000000002443
# create lasso, logged model with 4 predictors
lasso_model <- lm(avg_inf_time ~ log(degree) + closeness + log(pagerank) + log(eigen_centrality) , data = train_data)
summary(lasso_model)##
## Call:
## lm(formula = avg_inf_time ~ log(degree) + closeness + log(pagerank) +
## log(eigen_centrality), data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28.623 -8.792 -1.488 6.360 83.002
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -180.509 50.115 -3.602 0.000336 ***
## log(degree) 21.537 3.723 5.785 0.0000000106 ***
## closeness -14.133 48.943 -0.289 0.772839
## log(pagerank) -22.774 4.588 -4.964 0.0000008526 ***
## log(eigen_centrality) -1.699 1.248 -1.361 0.173940
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.48 on 759 degrees of freedom
## Multiple R-squared: 0.08099, Adjusted R-squared: 0.07615
## F-statistic: 16.72 on 4 and 759 DF, p-value: 0.0000000000003809
We see both have low explainability overall - by the R^2 but are significant with the P-value. It’s not clear these will be good predictors.
Now we test the predictive power of both models:
# Base prediction
base_preds <- predict(base_model, newdata = test_data, type = "response")
#lasso prediction
lasso_preds <- predict(lasso_model, newdata = test_data, type = "response")
# MSE comparison
mean((test_data$avg_inf_time - base_preds)^2)## [1] 165.6918
## [1] 161.0754
The MSE of the lasso model is smaller than the base model, then we
will keep the results from the lasso model as our best option.
7.2. (Based on step 3) Find the predicted 1% earliest infected nodes
Then, as in the step 3, we again try to find a better set of 1% of seeds in the network so we get more infected people than the random case.
We do a quick boxplot too and see the peak is clearly around time 55-60. This is when the epidemic peaks. However there are low outliers that would be the first to get infected in the predictions. These are who we will be producing as our first to get infected.
# add on our lasso predictions
data_7$predicted_time <- predict(lasso_model, newdata = data_7, type = "response")
# plot the infection distribution.
boxplot(data_7$predicted_time)num_seeds <- ceiling(0.01 * vcount(graph)) #set the percentage of 1%
best_seed_nodes <- data_7 %>%
arrange(predicted_time) %>%
slice(1:num_seeds) %>% #select the nodes with earlier predicted infection time
pull(node)
print(best_seed_nodes)## [1] 1103 768 758 759 1125 1096 1133 1052 998 1099 514 160
7.3. (Based on step 5) Convince the most connected 5% to be kind (based on our prediction), compare the results of random or targeted vaccination
Then we replicate the step 5 but with the same seed according to the lasso model (targeted seeding), we also do 30 times of simulation for both random and pagerank vaccination.
Append our predicted data to the question 5c output where we ran 30 of random and pagerank vaccinations.
## we also calcualte the 5% of the most connected seeds for our next question
num_seeds <- ceiling(0.05 * vcount(graph)) #set the percentage of 1%
pred_seed_nodes <- data_7 %>%
arrange(predicted_time) %>%
slice(1:num_seeds) %>% #select the nodes with earlier predicted infection time
pull(node)
### now we re-run Q5 with this info on the 5% predicted to be earliest to get infected, we compare them to random 5% selection and the 5% pagerank selection.
set.seed(1234)
# run the predicted model 30 sims to see total infections.
for (i in 1:30) {
graph_pred <- delete_vertices(graph, pred_seed_nodes)
seed_n <- ceiling(0.01 * vcount(graph_pred))
inf_seeds <- sample(1:vcount(graph_pred), seed_n)
# run with the repeated data from Q5
sim <- sim_sir(graph_pred, beta = beta_c, mu = mu_5c, seeds = inf_seeds)
total_inf <- length(unique(sim$inf))
peak <- sim %>%
group_by(t) %>%
summarise(ninf = n()) %>%
filter(ninf == max(ninf)) %>%
pull(t) %>%
min() # in case of multiple peak times
# 5c already exists, we just add on our results to a new df
results_5c <- rbind(results_5c, data.frame(
strategy = "Predicted",
total_infected = total_inf,
peak_time = peak
))
}
results_7 <- results_5c
results_7## strategy total_infected peak_time
## 1 Random 49 0
## 2 Random 13 0
## 3 Random 17 0
## 4 Random 14 0
## 5 Random 24 0
## 6 Random 24 0
## 7 Random 51 0
## 8 Random 23 0
## 9 Random 36 0
## 10 Random 29 0
## 11 Random 22 0
## 12 Random 20 0
## 13 Random 37 0
## 14 Random 13 0
## 15 Random 48 0
## 16 Random 18 0
## 17 Random 52 0
## 18 Random 13 0
## 19 Random 18 0
## 20 Random 19 0
## 21 Random 19 0
## 22 Random 14 0
## 23 Random 19 0
## 24 Random 58 0
## 25 Random 32 0
## 26 Random 32 0
## 27 Random 29 0
## 28 Random 43 0
## 29 Random 30 0
## 30 Random 29 0
## 31 PageRank 15 0
## 32 PageRank 13 0
## 33 PageRank 23 0
## 34 PageRank 13 0
## 35 PageRank 15 0
## 36 PageRank 14 0
## 37 PageRank 20 0
## 38 PageRank 19 0
## 39 PageRank 16 0
## 40 PageRank 13 0
## 41 PageRank 18 0
## 42 PageRank 19 0
## 43 PageRank 20 0
## 44 PageRank 20 0
## 45 PageRank 12 0
## 46 PageRank 12 0
## 47 PageRank 19 0
## 48 PageRank 12 0
## 49 PageRank 20 0
## 50 PageRank 15 0
## 51 PageRank 17 0
## 52 PageRank 39 0
## 53 PageRank 22 0
## 54 PageRank 14 0
## 55 PageRank 15 0
## 56 PageRank 17 0
## 57 PageRank 20 0
## 58 PageRank 17 0
## 59 PageRank 23 0
## 60 PageRank 16 0
## 61 Predicted 61 0
## 62 Predicted 24 0
## 63 Predicted 52 0
## 64 Predicted 24 0
## 65 Predicted 22 0
## 66 Predicted 18 0
## 67 Predicted 27 0
## 68 Predicted 35 0
## 69 Predicted 139 0
## 70 Predicted 23 0
## 71 Predicted 19 0
## 72 Predicted 36 0
## 73 Predicted 13 0
## 74 Predicted 22 0
## 75 Predicted 48 0
## 76 Predicted 22 0
## 77 Predicted 42 0
## 78 Predicted 15 0
## 79 Predicted 13 0
## 80 Predicted 18 0
## 81 Predicted 22 0
## 82 Predicted 29 0
## 83 Predicted 80 0
## 84 Predicted 26 0
## 85 Predicted 16 0
## 86 Predicted 61 0
## 87 Predicted 80 0
## 88 Predicted 61 0
## 89 Predicted 32 0
## 90 Predicted 60 0
Summarise the outputs:
# Summary
summary_7 <- results_7 %>%
group_by(strategy) %>%
summarise(
avg_infected = mean(total_infected),
sd_infected = sd(total_infected),
med_infect = median(total_infected),
max_peak = max(total_infected),
peak_time = min(peak_time)
)
print(summary_7)## # A tibble: 3 × 6
## strategy avg_infected sd_infected med_infect max_peak peak_time
## <chr> <dbl> <dbl> <dbl> <int> <dbl>
## 1 PageRank 17.6 5.21 17 39 0
## 2 Predicted 38 27.3 26.5 139 0
## 3 Random 28.2 13.2 24 58 0
We see that selecting our predicted nodes based on closeness was not effective. It had the highest variability, but actually had a higher average that random infection. The median infections in our prediction model is lower than the random selection, implying there was some outliers with our model. This will be easier to see with a visualisation:
# Visualisations
ggplot(results_7, aes(x = strategy, y = total_infected, fill = strategy)) +
geom_boxplot() +
geom_jitter(width = 0.1, alpha = 0.6) +
labs(title = "Total Infected (30 runs) by different vaccination strategies")##
## 0
## PageRank 30
## Predicted 30
## Random 30
Here, the results seem quite similar to the previous step 5: as we set the beta at the critical beta, all strategieshave the peak time at t=0 in all simulations. As we saw in Q5, the vaccinations with the 5% deleted nodes do not turn to an epidemic. This is illustrated below.
We see in the following plot that under the predicted model, the number of infections also drops off, but less rapidly than the random or targeted approach to select the 5%.
# plot the trend based on our tested data:
# inf seeds still set from graph_pred calc in 7.2
realization_pred <- sim_sir(graph_pred,
beta = beta_c,
mu,
inf_seeds) %>%
group_by(t) %>% summarize(ninf=n())
# plot comparing
ggplot() +
geom_line(data=realization_random,aes(x=t,y=ninf,col="Vacc. Random"))+
geom_line(data=realization_pager,aes(x=t,y=ninf,col="Vacc. Targeted"))+
geom_line(data=realization_pred, aes(x=t,y=ninf,col="Predicted nodes"))+
labs(title = "Plot comparing 5% exclusion",
subtitle = "Methods: Random, using pagerank and using predicted nodes")And regarding the total infected number, the median is lower under targeted vaccination than random.
Our predictions are worse with modelling than just using the standard connectivity measures. Overall, it shows our predictive model had way too much noise and wasn’t very effective. The high standard error shows that our model was picking up a lot of noise. But it’s not too noticeable in Q5 as the 5% exclusion tends to work quite well with random selection anyway.
7.4. Conclusion
If we combine the insights from step 5 and step 7, we can conclude that, on the one hand, no matter whether the vaccination strategy is taken (targeted or random), the initial seed was more effective when it’s targeted seeding than when it’s random seeding.
The last conclusion is compatible with the mathematical proofs we learned from class. The targeted vaccination will naturally delete the high degree nodes, which reduces significantly the \({\langle k^2 \rangle}\). This effect can reflect in:
\[ R_0 = \frac{\beta}{\mu} \cdot \frac{\langle k^2 \rangle}{\langle k \rangle} \]with lower \({\langle k^2 \rangle}\), the \(R_0\) becomes reduced, meaning lower probability of spreading;
And \[ g_c \approx 1 - \frac{\mu \langle k \rangle_{k \leq k_M}}{\beta \langle k^2 \rangle_{k \leq k_M}} \] with lower\({\langle k^2 \rangle}\), the critical vaccination threshold, \(g_c\) , becomes reduced, meaning the lower vaccination coverage is required to stop the spread.
Final conclusions
Throughout this project, we explored how rumors spread in a real-life university email network using the SIR model. We tried out different strategies—both for picking who starts the rumor and who we “vaccinate” (i.e., prevent from spreading it)—and found some interesting results:
First, we saw in action the epidemic threshold formula from class: \[ \beta_c = \mu \frac{\langle k \rangle}{\langle k^2 \rangle}\] When the infection rate β goes above this value, the rumor spreads like wildfire. Below it, nothing much happens. It was great to see this theory come to life in our simulation results.
When it comes to who starts the spread, using central nodes (like those with high PageRank or degree) clearly makes the rumor spread faster and more widely than picking random people. That matched what we expected from network theory.
For vaccination, targeting the most central people again made a huge difference—it stopped the spread much more effectively than random vaccination. The idea that removing hubs slows down the spread really worked here.
In short, this project helped us test a lot of the mathematical concepts from class using real data and real simulations. It was cool to see how theory translates into actual strategies that work—and to see just how much network structure matters when it comes to spreading anything, whether it’s gossip, viruses, or ideas.
Highly connected nodes really can dominate the outcome of spreading success in models!
6) Comment on the relationship between the findings in steps 3 and 5 using the same type of centrality for the 1% in step 3 and 5% in step 5.
Here we compare our 2 PageRank methods applied in Q3 and Q5B. To summarise, we did:
In Q3 - apply a very high beta value (15%) above the critical value to maximise our infection spread and used centrality measures to select the 1% of infections using PageRank.
In Q5 - we select 5% of the population using PageRank to essentially treat them as kind_people who will not spread any rumours.
To compare the outputs we first outline differences in the methods:
Fewer initial infected nodes in Q5. Due to sequencing, we already remove 5% of nodes, then apply the 1% infection (per the question), we remove 10 nodes initially. While in Q3, we take our 1% sample from the whole unaffected population, where we have 12 initial infected nodes.
Q3 intends to maximise spread by infecting the most connected nodes, while Q5 intends to prevent spread by removing the most infected nodes.
Q3 specifies to use a high beta value (we select \(\beta_c = 0.15\), while Q5 specifies to use the critical beta value (\(\beta_c = 0.00535\)). This means our infectivity is around 28 times higher in Q3 compared to Q5.
Overall, these 2 questions seek very different things but both highlight the importance of central nodes when considering the spread of rumors in the university email system. These central nodes could be considered the super-spreaders at risk of creating a more rapid epidemic.
Combining the findings, we see that who is infected initially is highly predictive of whether information will spread or die out. The influential people in a social network have so much power, if they work together to spread a rumor over the university email, they will almost succeed. While if they choose to completely disengage and be kind people, the rumor will not survive! Let’s hope they are nice people.